In this project, we are using a dataset of songs on the music streaming app Spotify.
The dataset contains songs on Spotify across multiple genres, and we will be performing several analyses to this dataset, such as basic descriptive and bivariate statistics, Principal Component Analysis, decision trees, regression, and clustering.
Here is the link to the original dataset
First, we import the data to R and make sure R is reading the data properly.
# importing relevant libraries to perform cleaning on the data
library(tidyverse)
library(janitor)
setwd("~/Documents/class/stats-final-project/")
# importing the data and cleaning the names into a snake_case format.
raw_data <- read.csv("dataset.csv") %>% clean_names()
The dataset has 114,000 rows and 21 columns/variables.
It has the following scores (numerical variables): * popularity: The popularity of a track is a value between 0 and 100, with 100 being the most popular. - duration_ms: The track length in milliseconds - danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable - energy: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale - loudness: The overall loudness of a track in decibels (dB) - speechiness: Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. - acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic - instrumentalness: Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content - liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live - valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry) - tempo: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration
The dataset also has the following categorical variables: - explicit: Whether or not the track has explicit lyrics (true = yes it does; false = no it does not OR unknown) - mode: Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0 - key: The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1 - track_genre: The genre in which the track belongs
It also has the following columns that describe the songs: - track_id: The Spotify ID for the track - artists: The artists’ names who performed the track. If there is more than one artist, they are separated by a ; - album_name: The album name in which the track appears - track_name: Name of the track
Now we are performing several checks on the data
#dim(raw_data)
#names(raw_data)
#sapply(raw_data, class)
We’ll perform the following transforms to the data in order to prepare it for our analysis:
Since some of the categorical variables - mode, key, and time_signature - are currently codified as numerical, we will do the following: 1. key: We will be converting them from numbers 0 to 11 to the letter value of the key (C = 0, C# = 1, etc.) 2. mode: Instead of 0 for minor and 1 for major, we’ll convert them to “major” and “minor”. 3. time_signature: We’ll convert into characters instead of numeric.
We are also adding 2 more variables, which are: 1. multiple_artists: If there are multiple artists performing the track, the artists column will contain all artists separated by a semicolon (;). We’ll add a true value in this column if there are multiple artists, and false if a single artist. 2. tempo_cat: this is a categorical variable based on the tempo column. We’ll use the beats per minute to determine which tempo marking it fits in. This will be an ordinal variable, with the levels defined.
We are also performing several filters to scope our analysis: 1.
Filter to songs that are done by popular artists. This is done by
finding the artists that have 20 songs or more and filtering to just the
songs by those artists. 2. We also scope the analysis to just songs that
are less than 10 minutes. 3. We’ll also remove the duplicated songs.
This is because some songs are listed in albums or single versions. This
is done by removing songs that have the same variable in the track_name
and artists columns. 4. We also sample the data to just 3,000
rows/songs. This is done by random sampling using the
sample_n() function.
Finally, we’ll just select the columns that are relevant to us in our analysis, and remove the descriptive columns, track_id, artists, album_name, and track_name.
# creating a list of keys from C to B
key_alpha <- c('C','C#/Db','D','D#/Eb','E','F','F#/Gb','G','G#/Ab','A','A#/Bb','B')
# creating a new df for mapping keys
key_map <- data.frame(key = c(0:11),
key_alpha = key_alpha)
data <- raw_data %>%
full_join(key_map, "key") %>%
mutate(mode = str_replace(as.character(mode), "0", "minor"),
mode = str_replace(as.character(mode), "1", "major"),
time_signature = as.character(time_signature),
# converting keys to alphabet
key = key_alpha,
# adding a column for whether the artist is one or multiple. True for Multiple, false for single
multiple_artists = grepl(";", artists)
) %>%
select(-1, -22) %>%
# removing duplicates in songs, because some tracks are in multiple albums
distinct(track_name, artists, .keep_all = TRUE)
# finding the popular artists, with more than 20 songs listed.
popular_artists <- data %>% group_by(artists) %>%
summarize(count = n()) %>%
filter(count >= 20)
filtered <- data %>% filter(artists %in% popular_artists$artists) %>%
filter(duration_ms <= 600000) %>%
#removing the track_id, title, artists, albums because it's not needed.
select(5:21) %>%
mutate(
# adding an ordinal variable for tempo
tempo_cat = cut(tempo,
breaks=c(0, 20, 40, 60, 66, 76, 108, 120, 168, 176, 200, 1000),
labels=c('Larghissimo','Grave','Lento/Largo','Larghetto','Adagio','Andante','Moderato','Allegro','Vivace','Presto','Prestissimo'))
)
# random sampling the data to just 3000 songs
set.seed(1)
dd <- filtered %>% sample_n(3000)
# attaching the column names
attach(dd)
# for one-time exporting for other analysis
# dd %>% write_csv("cleaneddata.csv")
n<-dim(dd)[1]
K<-dim(dd)[2]
descriptiva<-function(X, nom){
if (!(is.numeric(X) || class(X)=="Date")){
frecs<-table(as.factor(X), useNA="ifany")
proportions<-frecs/n
#ojo, decidir si calcular porcentages con o sin missing values
pie(frecs, cex=0.6, main=paste("Pie of", nom))
barplot(frecs, las=3, cex.names=0.7, main=paste("Barplot of", nom), col=listOfColors)
print(paste("Number of modalities: ", length(frecs)))
print("Frequency table")
print(frecs)
print("Relative frequency table (proportions)")
print(proportions)
print("Frequency table sorted")
print(sort(frecs, decreasing=TRUE))
print("Relative frequency table (proportions) sorted")
print(sort(proportions, decreasing=TRUE))
}else{
if(class(X)=="Date"){
print(summary(X))
print(sd(X))
#decide breaks: weeks, months, quarters...
hist(X,breaks="weeks")
}else{
hist(X, main=paste("Histogram of", nom))
boxplot(X, horizontal=TRUE, main=paste("Boxplot of",nom))
print("Extended Summary Statistics")
print(summary(X))
print(paste("sd: ", sd(X, na.rm=TRUE)))
print(paste("vc: ", sd(X, na.rm=TRUE)/mean(X, na.rm=TRUE)))
}
}
}
dataset<-dd
actives<-c(1:K)
colDate<-1
if (dataset=="platjaDaro")
{dd[,colDate]<-as.Date(dd[, colDate], format="%d/%m/%y %h:%m:%s")
actives<-c(3:44)
}
Warning: the condition has length > 1 and only the first element will be used
listOfColors<-rainbow(39)
par(ask=TRUE)
for(k in actives){
print(paste("variable ", k, ":", names(dd)[k] ))
descriptiva(dd[,k], names(dd)[k])
}
[1] "variable 1 : popularity"
[1] "Extended Summary Statistics"
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 19.00 30.00 32.78 46.00 97.00
[1] "sd: 19.071854500815"
[1] "vc: 0.581754585688306"
[1] "variable 2 : duration_ms"
[1] "Extended Summary Statistics"
Min. 1st Qu. Median Mean 3rd Qu. Max.
28946 167156 215889 224103 268796 594533
[1] "sd: 86933.4740717619"
[1] "vc: 0.387917069526423"
[1] "variable 3 : explicit"
[1] "Number of modalities: 2"
[1] "Frequency table"
False True
2812 188
[1] "Relative frequency table (proportions)"
False True
0.93733333 0.06266667
[1] "Frequency table sorted"
False True
2812 188
[1] "Relative frequency table (proportions) sorted"
False True
0.93733333 0.06266667
[1] "variable 4 : danceability"
[1] "Extended Summary Statistics"
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.4270 0.5480 0.5410 0.6653 0.9750
[1] "sd: 0.176265638860442"
[1] "vc: 0.325833140160227"
[1] "variable 5 : energy"
[1] "Extended Summary Statistics"
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000242 0.423000 0.666500 0.624997 0.861250 0.999000
[1] "sd: 0.265941258309461"
[1] "vc: 0.42550798946771"
[1] "variable 6 : key"
[1] "Number of modalities: 12"
[1] "Frequency table"
A A#/Bb B C C#/Db D D#/Eb E F F#/Gb G G#/Ab
326 187 221 366 250 346 90 269 252 160 383 150
[1] "Relative frequency table (proportions)"
A A#/Bb B C C#/Db D D#/Eb E F F#/Gb
0.10866667 0.06233333 0.07366667 0.12200000 0.08333333 0.11533333 0.03000000 0.08966667 0.08400000 0.05333333
G G#/Ab
0.12766667 0.05000000
[1] "Frequency table sorted"
G C D A E F C#/Db B A#/Bb F#/Gb G#/Ab D#/Eb
383 366 346 326 269 252 250 221 187 160 150 90
[1] "Relative frequency table (proportions) sorted"
G C D A E F C#/Db B A#/Bb F#/Gb
0.12766667 0.12200000 0.11533333 0.10866667 0.08966667 0.08400000 0.08333333 0.07366667 0.06233333 0.05333333
G#/Ab D#/Eb
0.05000000 0.03000000
[1] "variable 7 : loudness"
[1] "Extended Summary Statistics"
Min. 1st Qu. Median Mean 3rd Qu. Max.
-42.631 -11.155 -7.498 -8.888 -5.181 0.377
[1] "sd: 5.32258438263559"
[1] "vc: -0.598842498009395"
[1] "variable 8 : mode"
[1] "Number of modalities: 2"
[1] "Frequency table"
major minor
2022 978
[1] "Relative frequency table (proportions)"
major minor
0.674 0.326
[1] "Frequency table sorted"
major minor
2022 978
[1] "Relative frequency table (proportions) sorted"
major minor
0.674 0.326
[1] "variable 9 : speechiness"
[1] "Extended Summary Statistics"
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00000 0.03470 0.04690 0.07862 0.07490 0.96200
[1] "sd: 0.103964236837045"
[1] "vc: 1.32239625374972"
[1] "variable 10 : acousticness"
[1] "Extended Summary Statistics"
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000001 0.013300 0.217000 0.348478 0.678250 0.996000
[1] "sd: 0.349889758582825"
[1] "vc: 1.00405207937953"
[1] "variable 11 : instrumentalness"
[1] "Extended Summary Statistics"
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000000 0.0000000 0.0000889 0.1887061 0.1462500 1.0000000
[1] "sd: 0.33859412273214"
[1] "vc: 1.7942934790234"
[1] "variable 12 : liveness"
[1] "Extended Summary Statistics"
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0112 0.1010 0.1410 0.2379 0.3070 0.9920
[1] "sd: 0.216531534653923"
[1] "vc: 0.910165401744503"
[1] "variable 13 : valence"
[1] "Extended Summary Statistics"
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.2500 0.4730 0.4771 0.6933 0.9850
[1] "sd: 0.267153566864831"
[1] "vc: 0.559899974951364"
[1] "variable 14 : tempo"
[1] "Extended Summary Statistics"
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 100.0 121.9 121.9 139.7 220.0
[1] "sd: 29.448628745521"
[1] "vc: 0.241613772469231"
[1] "variable 15 : time_signature"
[1] "Number of modalities: 5"
[1] "Frequency table"
0 1 3 4 5
4 30 271 2636 59
[1] "Relative frequency table (proportions)"
0 1 3 4 5
0.001333333 0.010000000 0.090333333 0.878666667 0.019666667
[1] "Frequency table sorted"
4 3 5 1 0
2636 271 59 30 4
[1] "Relative frequency table (proportions) sorted"
4 3 5 1 0
0.878666667 0.090333333 0.019666667 0.010000000 0.001333333
[1] "variable 16 : track_genre"
[1] "Number of modalities: 100"
[1] "Frequency table"
acoustic afrobeat alt-rock alternative ambient anime
23 41 46 8 41 32
black-metal bluegrass blues brazil breakbeat british
17 41 2 19 36 71
cantopop chicago-house children chill classical club
51 67 98 6 22 24
comedy country dance dancehall death-metal detroit-techno
32 5 2 14 21 79
disco disney drum-and-bass edm electro electronic
4 43 4 3 11 19
emo folk forro garage german gospel
30 18 50 33 24 14
goth grindcore groove grunge guitar happy
47 49 34 44 31 30
hard-rock hardcore heavy-metal hip-hop honky-tonk idm
45 12 103 8 109 61
indian indie indie-pop industrial iranian j-dance
9 1 2 47 44 26
j-idol j-pop j-rock jazz k-pop kids
110 12 16 2 37 84
latin latino malay mandopop metal metalcore
1 3 36 23 14 30
minimal-techno mpb new-age opera pagode party
6 16 63 20 59 20
piano pop pop-film power-pop progressive-house psych-rock
28 1 3 28 4 37
punk punk-rock r-n-b rock-n-roll rockabilly romance
5 17 11 35 29 53
salsa samba sertanejo show-tunes singer-songwriter ska
21 24 44 18 8 40
sleep spanish study swedish synth-pop tango
34 11 74 7 18 45
trance trip-hop turkish world-music
10 27 7 56
[1] "Relative frequency table (proportions)"
acoustic afrobeat alt-rock alternative ambient anime
0.0076666667 0.0136666667 0.0153333333 0.0026666667 0.0136666667 0.0106666667
black-metal bluegrass blues brazil breakbeat british
0.0056666667 0.0136666667 0.0006666667 0.0063333333 0.0120000000 0.0236666667
cantopop chicago-house children chill classical club
0.0170000000 0.0223333333 0.0326666667 0.0020000000 0.0073333333 0.0080000000
comedy country dance dancehall death-metal detroit-techno
0.0106666667 0.0016666667 0.0006666667 0.0046666667 0.0070000000 0.0263333333
disco disney drum-and-bass edm electro electronic
0.0013333333 0.0143333333 0.0013333333 0.0010000000 0.0036666667 0.0063333333
emo folk forro garage german gospel
0.0100000000 0.0060000000 0.0166666667 0.0110000000 0.0080000000 0.0046666667
goth grindcore groove grunge guitar happy
0.0156666667 0.0163333333 0.0113333333 0.0146666667 0.0103333333 0.0100000000
hard-rock hardcore heavy-metal hip-hop honky-tonk idm
0.0150000000 0.0040000000 0.0343333333 0.0026666667 0.0363333333 0.0203333333
indian indie indie-pop industrial iranian j-dance
0.0030000000 0.0003333333 0.0006666667 0.0156666667 0.0146666667 0.0086666667
j-idol j-pop j-rock jazz k-pop kids
0.0366666667 0.0040000000 0.0053333333 0.0006666667 0.0123333333 0.0280000000
latin latino malay mandopop metal metalcore
0.0003333333 0.0010000000 0.0120000000 0.0076666667 0.0046666667 0.0100000000
minimal-techno mpb new-age opera pagode party
0.0020000000 0.0053333333 0.0210000000 0.0066666667 0.0196666667 0.0066666667
piano pop pop-film power-pop progressive-house psych-rock
0.0093333333 0.0003333333 0.0010000000 0.0093333333 0.0013333333 0.0123333333
punk punk-rock r-n-b rock-n-roll rockabilly romance
0.0016666667 0.0056666667 0.0036666667 0.0116666667 0.0096666667 0.0176666667
salsa samba sertanejo show-tunes singer-songwriter ska
0.0070000000 0.0080000000 0.0146666667 0.0060000000 0.0026666667 0.0133333333
sleep spanish study swedish synth-pop tango
0.0113333333 0.0036666667 0.0246666667 0.0023333333 0.0060000000 0.0150000000
trance trip-hop turkish world-music
0.0033333333 0.0090000000 0.0023333333 0.0186666667
[1] "Frequency table sorted"
j-idol honky-tonk heavy-metal children kids detroit-techno
110 109 103 98 84 79
study british chicago-house new-age idm pagode
74 71 67 63 61 59
world-music romance cantopop forro grindcore goth
56 53 51 50 49 47
industrial alt-rock hard-rock tango grunge iranian
47 46 45 45 44 44
sertanejo disney afrobeat ambient bluegrass ska
44 43 41 41 41 40
k-pop psych-rock breakbeat malay rock-n-roll groove
37 37 36 36 35 34
sleep garage anime comedy guitar emo
34 33 32 32 31 30
happy metalcore rockabilly piano power-pop trip-hop
30 30 29 28 28 27
j-dance club german samba acoustic mandopop
26 24 24 24 23 23
classical death-metal salsa opera party brazil
22 21 21 20 20 19
electronic folk show-tunes synth-pop black-metal punk-rock
19 18 18 18 17 17
j-rock mpb dancehall gospel metal hardcore
16 16 14 14 14 12
j-pop electro r-n-b spanish trance indian
12 11 11 11 10 9
alternative hip-hop singer-songwriter swedish turkish chill
8 8 8 7 7 6
minimal-techno country punk disco drum-and-bass progressive-house
6 5 5 4 4 4
edm latino pop-film blues dance indie-pop
3 3 3 2 2 2
jazz indie latin pop
2 1 1 1
[1] "Relative frequency table (proportions) sorted"
j-idol honky-tonk heavy-metal children kids detroit-techno
0.0366666667 0.0363333333 0.0343333333 0.0326666667 0.0280000000 0.0263333333
study british chicago-house new-age idm pagode
0.0246666667 0.0236666667 0.0223333333 0.0210000000 0.0203333333 0.0196666667
world-music romance cantopop forro grindcore goth
0.0186666667 0.0176666667 0.0170000000 0.0166666667 0.0163333333 0.0156666667
industrial alt-rock hard-rock tango grunge iranian
0.0156666667 0.0153333333 0.0150000000 0.0150000000 0.0146666667 0.0146666667
sertanejo disney afrobeat ambient bluegrass ska
0.0146666667 0.0143333333 0.0136666667 0.0136666667 0.0136666667 0.0133333333
k-pop psych-rock breakbeat malay rock-n-roll groove
0.0123333333 0.0123333333 0.0120000000 0.0120000000 0.0116666667 0.0113333333
sleep garage anime comedy guitar emo
0.0113333333 0.0110000000 0.0106666667 0.0106666667 0.0103333333 0.0100000000
happy metalcore rockabilly piano power-pop trip-hop
0.0100000000 0.0100000000 0.0096666667 0.0093333333 0.0093333333 0.0090000000
j-dance club german samba acoustic mandopop
0.0086666667 0.0080000000 0.0080000000 0.0080000000 0.0076666667 0.0076666667
classical death-metal salsa opera party brazil
0.0073333333 0.0070000000 0.0070000000 0.0066666667 0.0066666667 0.0063333333
electronic folk show-tunes synth-pop black-metal punk-rock
0.0063333333 0.0060000000 0.0060000000 0.0060000000 0.0056666667 0.0056666667
j-rock mpb dancehall gospel metal hardcore
0.0053333333 0.0053333333 0.0046666667 0.0046666667 0.0046666667 0.0040000000
j-pop electro r-n-b spanish trance indian
0.0040000000 0.0036666667 0.0036666667 0.0036666667 0.0033333333 0.0030000000
alternative hip-hop singer-songwriter swedish turkish chill
0.0026666667 0.0026666667 0.0026666667 0.0023333333 0.0023333333 0.0020000000
minimal-techno country punk disco drum-and-bass progressive-house
0.0020000000 0.0016666667 0.0016666667 0.0013333333 0.0013333333 0.0013333333
edm latino pop-film blues dance indie-pop
0.0010000000 0.0010000000 0.0010000000 0.0006666667 0.0006666667 0.0006666667
jazz indie latin pop
0.0006666667 0.0003333333 0.0003333333 0.0003333333
[1] "variable 17 : multiple_artists"
[1] "Number of modalities: 2"
[1] "Frequency table"
FALSE TRUE
2927 73
[1] "Relative frequency table (proportions)"
FALSE TRUE
0.97566667 0.02433333
[1] "Frequency table sorted"
FALSE TRUE
2927 73
[1] "Relative frequency table (proportions) sorted"
FALSE TRUE
0.97566667 0.02433333
[1] "variable 18 : tempo_cat"
[1] "Number of modalities: 12"
[1] "Frequency table"
Larghissimo Grave Lento/Largo Larghetto Adagio Andante Moderato Allegro Vivace
0 0 10 23 96 858 422 1335 119
Presto Prestissimo <NA>
116 17 4
[1] "Relative frequency table (proportions)"
Larghissimo Grave Lento/Largo Larghetto Adagio Andante Moderato Allegro Vivace
0.000000000 0.000000000 0.003333333 0.007666667 0.032000000 0.286000000 0.140666667 0.445000000 0.039666667
Presto Prestissimo <NA>
0.038666667 0.005666667 0.001333333
[1] "Frequency table sorted"
Allegro Andante Moderato Vivace Presto Adagio Larghetto Prestissimo Lento/Largo
1335 858 422 119 116 96 23 17 10
<NA> Larghissimo Grave
4 0 0
[1] "Relative frequency table (proportions) sorted"
Allegro Andante Moderato Vivace Presto Adagio Larghetto Prestissimo Lento/Largo
0.445000000 0.286000000 0.140666667 0.039666667 0.038666667 0.032000000 0.007666667 0.005666667 0.003333333
<NA> Larghissimo Grave
0.001333333 0.000000000 0.000000000
par(ask=FALSE)
#per exportar figures d'R per programa
#dev.off()
After seeing the basic descriptive statistics of the data, we’ll do a bivariate statistics analysis. The purpose is to find relationships between: 1. Categorical vs categorical variables 2. Categorical vs numerical variables 3. Numerical vs numerical variables
We examine the relationship between tempo marking and mode.
#png(file=mypath,width = 950, height = 800, units = "px")
#dev.off()
library(ggplot2)
# stacked bar chart
ggplot(dd,
aes(x = tempo_cat,
fill = mode)) +
geom_bar(position = "stack")
ggplot(dd,
aes(x = tempo_cat,
fill = mode)) +
geom_bar(position = "dodge")
ggplot(dd,
aes(x = tempo_cat,
fill = mode)) +
Error: Incomplete expression: fill = mode)) +
geom_bar(position = "stack")
ggplot(dd,
aes(x = tempo_cat,
fill = mode)) +
geom_bar(position = "dodge")
ggplot(dd,
aes(x = tempo_cat,
fill = mode)) +
geom_bar(position = "fill")
From the above bar charts, we see that while there are more songs in the major key, some tempo markings have a higher proportion of minor songs than others. Larghetto and Vivace songs are the two tempo markings that have the highest proportion of minor songs, which is interesting because Larghetto is on the slower end, and Vivace is on the faster end.
Next, we can also examine the relationship between mode and explicitness.
From the plots above: there aren’t many songs that are explicit, and it is hard to tell the relationship. The proportion of minor songs for songs that are explicit is slightly higher than major songs. However, it’s a very minimal difference.
We can also examine the relationship between track_genre and mode.
We can see there are some genres that clearly stand out that have majority minor songs. All latin songs are in the minor key. Synth-pop, turkish, trance, dancehall, romance, spanish, anime and hiphop songs are also among the top 10 genres with a high proportion of minor songs.
So, there seem to be a relationship between genre and mode.
Explicit and genre also seems to be related, as songs that are explicit tend to be from some genres. Latino songs are 100% explicit. Comedy, country, dance and some of the metal genres also tend to contain swear words.
Next we can examine the relationship between key and time signature.
The key of B has the highest proportion of minor songs. F#/Gb, E and A#/Bb also have a relatively higher percentage of minor songs.
We can also perform analysis on categorical vs numerical variables by using charts such as multiple boxplots to plot the distribution of one numerical variable given another categorical variable.
We have used the code by Dr. Karina Gibert to do an overview of the
variables, and below are the interesting plots in
ggplot()
First, the code below creates functions to test numerical and qualitative variables.
Let’s run the profiling script for the mode variable
Findings:
From the boxplot of valence vs mode, we can see that minor songs tend to have lower valence (sadder mood) than major songs. Similarly, songs in the minor key tend to have lower tempo compared to songs in the major key, although there are outliers.
Interestingly, many songs in the major key are in the key of G, C, D and A. Whereas the most popular key for minor songs are A, B and E.
For popularity, songs in the minor key have a wider range compared to major songs.
Genre vs valence.
One genre stood out when looking at highest valence: r&b. The Sleep genre has the lowest mean valence.
Next we can examine the relationship between genre and energy, by calculating the mean.
Classical songs have the least mean energy, and drum-and-bass songs have the highest mean energy.
We can also examine the relationship between genre and danceability
Energy vs mode
Minor songs actually have a higher range of energy compared to major songs, which is interesting because one would think that there would be more happy songs (typically major key) with higher energy. But this could be because many of the latin songs are in minor key.
Looking at valence vs explicitness, we see that songs that are explicit tend to have lower valence than songs that are clean.
Looking at valence vs explicitness, we see that songs that are explicit tend to have lower valence than songs that are clean.